Credit Card Customer Segmentation¶
We are working for a credit card company and are given a dataset that contains information about the company’s clients. We're asked to help segment them into different groups in order to apply different business strategies for each type of customer. Possible strategies are:
- Provide higher credit limits for customers that use the card a lot, but spend little money.
- Create incentives for those with high income who don't use the card as much as the company expects.
The company expects to receive a group for each client and an explanation of the characteristics of each group and the main points that make them different.
Data dictionary:
| Column | Description |
|---|---|
| customer_id | unique identifier for each customer. |
| age | customer age in years. |
| gender | customer gender (M or F). |
| dependent_count | number of dependents of each customer. |
| education_level | level of education ("High School", "Graduate", etc.). |
| marital_status | marital status ("Single", "Married", etc.). |
| estimated_income | the estimated income for the customer projected by the data science team. |
| months_on_book | time as a customer in months. |
| total_relationship_count | number of times the customer contacted the company. |
| months_inactive_12_mon | number of months the customer did not use the credit card in the last 12 months. |
| credit_limit | customer's credit limit. |
| total_trans_amount | the overall amount of money spent on the card by the customer. |
| total_trans_count | the overall number of times the customer used the card. |
| avg_utilization_ratio | daily average utilization ratio. |
Outline¶
import copy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PP_utils import data_cleaning, data_plotting, ds_preprocessing, ds_modeling
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
%matplotlib inline
plt.style.use('PP_utils/deeplearning.mplstyle')
pd.set_option('display.max_columns', 10)
from importlib import reload
# Load the dataset
df = pd.read_csv('data/customer_segmentation.csv')
df.head()
| customer_id | age | gender | dependent_count | education_level | ... | months_inactive_12_mon | credit_limit | total_trans_amount | total_trans_count | avg_utilization_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | 45 | M | 3 | High School | ... | 1 | 12691.0 | 1144 | 42 | 0.061 |
| 1 | 818770008 | 49 | F | 5 | Graduate | ... | 1 | 8256.0 | 1291 | 33 | 0.105 |
| 2 | 713982108 | 51 | M | 3 | Graduate | ... | 1 | 3418.0 | 1887 | 20 | 0.000 |
| 3 | 769911858 | 40 | F | 4 | High School | ... | 4 | 3313.0 | 1171 | 20 | 0.760 |
| 4 | 709106358 | 40 | M | 3 | Uneducated | ... | 1 | 4716.0 | 816 | 28 | 0.000 |
5 rows × 14 columns
df.columns
Index(['customer_id', 'age', 'gender', 'dependent_count', 'education_level',
'marital_status', 'estimated_income', 'months_on_book',
'total_relationship_count', 'months_inactive_12_mon', 'credit_limit',
'total_trans_amount', 'total_trans_count', 'avg_utilization_ratio'],
dtype='object')
Exploratory Data Analysis¶
df.shape
(10127, 14)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 10127 non-null int64 1 age 10127 non-null int64 2 gender 10127 non-null object 3 dependent_count 10127 non-null int64 4 education_level 10127 non-null object 5 marital_status 10127 non-null object 6 estimated_income 10127 non-null int64 7 months_on_book 10127 non-null int64 8 total_relationship_count 10127 non-null int64 9 months_inactive_12_mon 10127 non-null int64 10 credit_limit 10127 non-null float64 11 total_trans_amount 10127 non-null int64 12 total_trans_count 10127 non-null int64 13 avg_utilization_ratio 10127 non-null float64 dtypes: float64(2), int64(9), object(3) memory usage: 1.1+ MB
df.describe()
| customer_id | age | dependent_count | estimated_income | months_on_book | ... | months_inactive_12_mon | credit_limit | total_trans_amount | total_trans_count | avg_utilization_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.012700e+04 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | ... | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 |
| mean | 7.391776e+08 | 46.325960 | 2.346203 | 62078.206774 | 35.928409 | ... | 2.341167 | 8631.953698 | 4404.086304 | 64.858695 | 0.274894 |
| std | 3.690378e+07 | 8.016814 | 1.298908 | 39372.861291 | 7.986416 | ... | 1.010622 | 9088.776650 | 3397.129254 | 23.472570 | 0.275691 |
| min | 7.080821e+08 | 26.000000 | 0.000000 | 20000.000000 | 13.000000 | ... | 0.000000 | 1438.300000 | 510.000000 | 10.000000 | 0.000000 |
| 25% | 7.130368e+08 | 41.000000 | 1.000000 | 32000.000000 | 31.000000 | ... | 2.000000 | 2555.000000 | 2155.500000 | 45.000000 | 0.023000 |
| 50% | 7.179264e+08 | 46.000000 | 2.000000 | 50000.000000 | 36.000000 | ... | 2.000000 | 4549.000000 | 3899.000000 | 67.000000 | 0.176000 |
| 75% | 7.731435e+08 | 52.000000 | 3.000000 | 80000.000000 | 40.000000 | ... | 3.000000 | 11067.500000 | 4741.000000 | 81.000000 | 0.503000 |
| max | 8.283431e+08 | 73.000000 | 5.000000 | 200000.000000 | 56.000000 | ... | 6.000000 | 34516.000000 | 18484.000000 | 139.000000 | 0.999000 |
8 rows × 11 columns
numeric_cols = ['age', 'dependent_count', 'estimated_income', 'months_on_book',
'total_relationship_count', 'months_inactive_12_mon', 'credit_limit',
'total_trans_amount', 'total_trans_count', 'avg_utilization_ratio']
discrete_vars = ['dependent_count', 'months_on_book', 'total_relationship_count', 'months_inactive_12_mon', 'total_trans_count' ]
categorical_cols = ['gender', 'education_level', 'marital_status']
df_numeric_scaled = StandardScaler().fit_transform(df[numeric_cols])
df_numeric_scaled = pd.DataFrame(df_numeric_scaled, columns=numeric_cols)
df_numeric_scaled.head()
| age | dependent_count | estimated_income | months_on_book | total_relationship_count | months_inactive_12_mon | credit_limit | total_trans_amount | total_trans_count | avg_utilization_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.165406 | 0.503368 | 0.175810 | 0.384621 | 0.763943 | -1.327136 | 0.446622 | -0.959707 | -0.973895 | -0.775882 |
| 1 | 0.333570 | 2.043199 | -0.967166 | 1.010715 | 1.407306 | -1.327136 | -0.041367 | -0.916433 | -1.357340 | -0.616276 |
| 2 | 0.583058 | 0.503368 | 0.785397 | 0.008965 | 0.120579 | -1.327136 | -0.573698 | -0.740982 | -1.911206 | -0.997155 |
| 3 | -0.789126 | 1.273283 | -0.636973 | -0.241473 | -0.522785 | 1.641478 | -0.585251 | -0.951758 | -1.911206 | 1.759686 |
| 4 | -0.789126 | 0.503368 | 0.074212 | -1.869317 | 0.763943 | -1.327136 | -0.430877 | -1.056263 | -1.570365 | -0.997155 |
reload(data_cleaning)
data_cleaning.plot_numeric_distributions(df_numeric_scaled,numeric_cols)
reload(data_plotting)
data_plotting.plot_multicollinearity_checks(df[numeric_cols],pair_plot=False)
The Pearson correlation coefficient measures the linear relationship between two datasets.
Test for the statistical significance of the correlation coefficient:
ρ = population correlation coefficient
Null hypothesis (H0): ρ = 0 populations are linearly uncorrelated
Alternative hypothesis (H1): ρ ≠ 0 populations are linearly correlated
p-value < 0.05 => reject H0
Each independent variable is tested against the rest to calclulate VIF(Variance Inflation Factor).
VIF for each independent variable:
-------------------------------------------------
VIF feature result
0 2.7 age Moderately correlated
1 1.0 dependent_count Not correlated
2 1.4 estimated_income Moderately correlated
3 2.7 months_on_book Moderately correlated
4 1.1 total_relationship_count Moderately correlated
5 1.0 months_inactive_12_mon Not correlated
6 1.7 credit_limit Moderately correlated
7 3.2 total_trans_amount Moderately correlated
8 3.0 total_trans_count Moderately correlated
9 1.3 avg_utilization_ratio Moderately correlated
Here is how it can be interpreted :
1= Not correlated
1–5 = Moderately correlated
>5 = Highly correlated
data_cleaning.plot_categorical_bars(df,categorical_cols)
EDA observations:
- There are no missing values. Dataset has 14 features and 10,127 observations. 3 of these features are categorical: ['gender', 'education_level', 'marital_status']
- There is a strong correlation between months_on_book and age features
- total_trans_count and total_trans_amount are also highly correlated
- avg_utilization_ration and credit_limit have a non-linear relationship
- Dataset has roughly balanced proportion of the two genders.
Prepare the data for modeling¶
- customer_id: this is the unique identifier for each customer and shouldn't be used for machine learning. we need to drop this.
- categorical columns need to be encoded.
- The gender column contains only 2 unique values, "M" and "F", which means it's possible to replace all the rows with 1 and 0
- The education_level column contains 6 unique values. This column represents levels of education, and it's possible to rank the levels from the lowest to the highest. Ordinal encoding is particularly beneficial in clustering algorithms like K-means, which rely on distance calculations. By assigning numerical values that reflect the natural progression of education levels, the algorithm can better capture the underlying structure in the data, avoiding the potential pitfalls of treating ordered categories as unrelated. However, it is important to ensure that the ordinal mapping accurately reflects the real-world progression and that the resulting numerical values do not distort the distance metrics in the clustering process.
- The marital_status column has 4 unique values. We will use one-hot-encoding to create dummy variables for this column.
- All features need to be scaled, not just numeric columns.
- For clustering, StandardScaler() is generally better than MinMaxScaler(). StandardScaler centers data around zero with unit variance, making it less sensitive to outliers—critical for distance-based algorithms like K-means.
- Even though one-hot encoded columns are often left unscaled for tree based classification models, but clustering algorithms like K-Means are fundamentally different — it's a distance-based algorithm. It computes Euclidean distances between data points. Scaling ensures that all features contribute equally to the clustering process.
df_encoded = copy.deepcopy(df) # create a copy before we do any modifications
# drop customer_id
df_encoded = df_encoded.drop(columns=['customer_id'])
# encode gender column
df_encoded['gender'] = df_encoded['gender'].map({'M':1, 'F':0})
# make the education_level ordinal
education_mapping = {
'Uneducated': 0,
'High School': 1,
'College': 2,
'Graduate': 3,
'Post-Graduate': 4,
'Doctorate': 5
}
df_encoded['education_level'] = df_encoded['education_level'].map(education_mapping)
# get dummies for marital_status
# it is not a good idea to use drop='first' in one-hot encoding for clustering
df_encoded = pd.get_dummies(df_encoded,columns=['marital_status'], prefix='marital',sparse=False,drop_first=False,dtype=int)
df_encoded.head()
| age | gender | dependent_count | education_level | estimated_income | ... | avg_utilization_ratio | marital_Divorced | marital_Married | marital_Single | marital_Unknown | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | 1 | 3 | 1 | 69000 | ... | 0.061 | 0 | 1 | 0 | 0 |
| 1 | 49 | 0 | 5 | 3 | 24000 | ... | 0.105 | 0 | 0 | 1 | 0 |
| 2 | 51 | 1 | 3 | 3 | 93000 | ... | 0.000 | 0 | 1 | 0 | 0 |
| 3 | 40 | 0 | 4 | 1 | 37000 | ... | 0.760 | 0 | 0 | 0 | 1 |
| 4 | 40 | 1 | 3 | 0 | 65000 | ... | 0.000 | 0 | 1 | 0 | 0 |
5 rows × 16 columns
df_encoded_cols = df_encoded.columns.to_list()
df_encoded_cols
['age', 'gender', 'dependent_count', 'education_level', 'estimated_income', 'months_on_book', 'total_relationship_count', 'months_inactive_12_mon', 'credit_limit', 'total_trans_amount', 'total_trans_count', 'avg_utilization_ratio', 'marital_Divorced', 'marital_Married', 'marital_Single', 'marital_Unknown']
# scale all columns using StandardScaler()
transformer = [
('num', StandardScaler(), df_encoded_cols)
]
# following util applies a list of transformers to specified columns, leaves unspecified columns unchanged
df_scaled,feature_names = ds_preprocessing.transform_cols_and_reconstuct_df(df_encoded, transformer)
df_scaled.head()
| age | gender | dependent_count | education_level | estimated_income | ... | avg_utilization_ratio | marital_Divorced | marital_Married | marital_Single | marital_Unknown | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.165406 | 1.059956 | 0.503368 | -0.752211 | 0.175810 | ... | -0.775882 | -0.282405 | 1.077338 | -0.798507 | -0.282609 |
| 1 | 0.333570 | -0.943436 | 2.043199 | 0.662787 | -0.967166 | ... | -0.616276 | -0.282405 | -0.928214 | 1.252337 | -0.282609 |
| 2 | 0.583058 | 1.059956 | 0.503368 | 0.662787 | 0.785397 | ... | -0.997155 | -0.282405 | 1.077338 | -0.798507 | -0.282609 |
| 3 | -0.789126 | -0.943436 | 1.273283 | -0.752211 | -0.636973 | ... | 1.759686 | -0.282405 | -0.928214 | -0.798507 | 3.538459 |
| 4 | -0.789126 | 1.059956 | 0.503368 | -1.459710 | 0.074212 | ... | -0.997155 | -0.282405 | 1.077338 | -0.798507 | -0.282609 |
5 rows × 16 columns
df_scaled.columns
Index(['age', 'gender', 'dependent_count', 'education_level',
'estimated_income', 'months_on_book', 'total_relationship_count',
'months_inactive_12_mon', 'credit_limit', 'total_trans_amount',
'total_trans_count', 'avg_utilization_ratio', 'marital_Divorced',
'marital_Married', 'marital_Single', 'marital_Unknown'],
dtype='object')
Feature Selection¶
There is a strong correlation between following set of features:
- months_on_book and age
- total_trans_count and total_trans_amount
While clustering algorithms like k-means are not fundamentally broken by correlated features, including highly correlated variables can disproportionately influence distance calculations, effectively giving them more weight in cluster formation. This redundancy can lead to biased results and increased computational overhead, especially in high-dimensional datasets.
So in order to avoid any biased results, we will drop one of the features from each set. Before we do that, let's visualize hierarchical clustering on a correlation matrix using dendrogram.
data_plotting.plot_corr_dendrogram(df_scaled[numeric_cols])
(<Figure size 1200x800 with 3 Axes>,
(<Axes: title={'center': 'Hierarchical Clustering Dendrogram'}, ylabel='Ward Linkage Distance'>,
<Axes: title={'center': 'Reordered Correlation Matrix'}>),
array([[ 7. , 8. , 0.12027459, 2. ],
[ 0. , 3. , 0.23109905, 2. ],
[ 2. , 6. , 0.50723025, 2. ],
[ 9. , 12. , 0.72288993, 3. ],
[ 4. , 10. , 0.8603379 , 3. ],
[ 1. , 5. , 0.99082591, 2. ],
[11. , 15. , 1.06853755, 4. ],
[14. , 16. , 1.35159687, 7. ],
[13. , 17. , 1.40486505, 10. ]]))
Dendrogram confirms our previous observation. So we will select one representative from each cluster based on domain relevance.
Given the business strategies:
- Provide higher credit limits for customers that use the card a lot, but spend little money
- Create incentives for those with high income who don't use the card as much as expected
The more domain-relevant feature between age and months_on_book is months_on_book. Here is why:
- months_on_book reflects customer tenure — how long a customer has been with the bank. This directly relates to behavioral history and engagement, which are critical for identifying underutilized high-income customers or frequent low-spenders.
- While age is correlated with months_on_book (often younger customers are newer), tenure is more actionable for segmentation because it captures relationship length and loyalty, which aligns better with credit limit decisions and incentive targeting.
- For strategy #2, identifying high-income customers who don’t use the card much benefits from knowing how long they’ve been customers — a long-tenured but inactive high-income customer may warrant different incentives than a new one.
Between total_trans_amount and total_trans_count, the more relevant is total_trans_count. Here is why:
- total_trans_count captures how frequently a customer uses the card — directly aligning with usage behavior.
- Strategy #1 targets high-frequency, low-spend users — we need transaction count to identify high frequency users
- Strategy #2 focuses on low usage among high-income customers — again, need total_trans_count refers to find low frequency users
So let's drop 'age' and 'total_trans_amount'.
# drop age and total_trans_amount
df_select = df_scaled.drop(columns=[])#['age', 'total_trans_amount'])
df_select.columns
Index(['age', 'gender', 'dependent_count', 'education_level',
'estimated_income', 'months_on_book', 'total_relationship_count',
'months_inactive_12_mon', 'credit_limit', 'total_trans_amount',
'total_trans_count', 'avg_utilization_ratio', 'marital_Divorced',
'marital_Married', 'marital_Single', 'marital_Unknown'],
dtype='object')
Now our dataset is ready for machine learning.The next step is to decide the best number of clusters to use in the segmentation.
Selecting Number of Clusters: Elbow Curve and Silhouette analysis¶
We will use Elbow Curve and Silhouette analysis for this. We will use K-Means algorithm to create clusters. Elbow curve is a good starting point but it is important to keep business strategies in mind and that will help us with this decision. Here are the strategies of our company:
- High usage vs. low spending → Need to look for clusters with high total_trans_count but low avg_utilization_ratio. These customers demonstrate responsible credit behavior (low default risk) and high engagement, making them ideal candidates for higher credit limits to encourage more spending and increase issuer revenue through interchange fees, without significantly increasing risk.
- High-income, low engagement → Need to identify clusters with high estimated_income, low total_trans_count, and high months_inactive_12_mon.
We will also look into:
- credit_limit and avg_utilization_ratio to assess underutilized potential.
- months_on_book to distinguish loyal vs. new customers.
- total_relationship_count to identify deep-engagement opportunities.
reload(data_plotting)
labels_dict = data_plotting.clustering_plot_elbow_curve_using_KMeans(df=df_select,scale=False,max_k=10)
data_plotting.clustering_silhouette_analysis(df_select,labels_dict)
Silhouette coefficient interpretation: - Near +1: Well-clustered - Near 0: On the boundary - Negative: Likely wrong cluster
ds_modeling.clustering_performance_evaluation(df_select,labels_dict,plot_silhouette_analysis=False)
--- Interpretation --- • Silhouette Coefficient: Higher is better (close to 1). • Calinski-Harabasz Index: Higher is better. • Dunn Index: Higher is better. • Davies-Bouldin Index: Lower is better (closer to 0).
| Model | Silhouette | Calinski-Harabasz | Dunn Index | Davies-Bouldin | |
|---|---|---|---|---|---|
| 0 | 2 | 0.113 | 1195.650 | 0.050 | 2.801 |
| 1 | 3 | 0.110 | 1097.774 | 0.039 | 2.457 |
| 2 | 4 | 0.138 | 1082.395 | 0.073 | 2.096 |
| 3 | 5 | 0.139 | 1073.506 | 0.043 | 1.981 |
| 4 | 6 | 0.160 | 1133.327 | 0.045 | 1.765 |
| 5 | 7 | 0.154 | 1083.134 | 0.090 | 1.901 |
| 6 | 8 | 0.147 | 1031.277 | 0.060 | 1.803 |
| 7 | 9 | 0.138 | 982.642 | 0.034 | 1.854 |
| 8 | 10 | 0.136 | 924.556 | 0.041 | 1.838 |
The best possible candidate here is five clusters. Let's look at that options and interpret results.
Interpreting Results - Numeric Variables¶
Now, we will interpret the results and summarize the characteristics of each cluster and differentiate them from each other based on the variables used for the segmentation.
First, we'll analyze the numerical variables and see how they behave in each cluster.
# for 6 clusters
labels_k_5 = labels_dict[6]
df_with_k_5 = df.copy()
df_with_k_5['cluster'] = labels_k_5
df_with_k_5['cluster'].value_counts()
cluster 0 3260 2 2884 4 1638 1 867 3 740 5 738 Name: count, dtype: int64
We can see that cluster 0 is largest while cluster 5 is the smallest.
Considering the numeric variable only, we'll check on the average value of each variable per cluster. We just need to group the data and plot a bar chart for each column.
numeric_columns = df_with_k_5.select_dtypes(include=np.number).drop(['customer_id', 'cluster'], axis=1).columns
reload(data_plotting)
data_plotting.plot_mean_by_group(df_with_k_5,'cluster',numeric_columns)
For those numerical variables with higher correlations we saw earlier, we can also use a scatter plot to visualize this correlation grouped by clusters and analyze how the clusters change between each area of the chart.
correlated_col_pairs = [('age', 'months_on_book'),('total_trans_count','total_trans_amount'),('credit_limit','avg_utilization_ratio'),('estimated_income','credit_limit')]
reload(data_plotting)
data_plotting.plot_scatter_by_group(df_with_k_5,correlated_col_pairs,'cluster')
We can draw some early conclusions considering only the numeric variables.
For instance, Cluster 1 has the highest amount of money transitioned, while Cluster 4 has the highest income.
For the categorical columns, we'll plot the percentual distribution of each variable in each cluster. This will allow us to verify if a particular cluster is mostly composed of men, or of married people only, for example.
cat_columns = df_with_k_5.select_dtypes(include=['object'])
reload(data_plotting)
cross_tabs = data_plotting.create_and_plot_crosstab(df_with_k_5,'cluster',categorical_cols)
Considering the categorical variables, we notice that the education level is well divided between clusters. Cluster 0 is composed almost entirely of married people, while we don't know the marital status of anybody in Cluster 3. Cluster 4 is almost completely male and Cluster 2 is 100% made of single people.
Let's list the most important characteristics of each cluster and also make some suggestions and share our insights for each one of them.
Cluster 0¶
Characteristics: Mostly women; mostly married; low estimated income; low credit limit; low amount transitioned; high utilization rate.
Insight: Married people (majority women) with low income and limit but utilize too much of their credit with a few larger purchases
Cluster 1¶
Characteristics: Mostly men; mostly single and married, high credit limit; high amount transitioned; high number of transactions; low utilization rate; low total relationship count
Insight: People with high volume spent on the card, but low utilization. Could be incentivized to spend more and targeted for more engagement.
Cluster 2¶
Characteristics: Mostly women; 100% single people, low estimated income, low credit limit, low amount transitioned, high utilization rate.
Insight: Single (mostly women) people that use their card a lot but have low credit limits and income. Could be given a bit more credit limit.
Cluster 3¶
Characteristics: Gender well divided; low credit limit, high utilization rate; marital status 100% unknown; smaller cluster.
Insight: Men and women with low credit limits but do have high balances.
Cluster 4¶
Characteristics: Mostly men, mostly single and married, high estimated income, high credit limit; low amount transitioned; low utilization rate.
Insight: People (majority men) with high income and credit limits, but don't use the card. Could be incentivized to use it.
Cluster 5¶
Characteristics: Gender well divided, 100% divorced, low credit limit, low amount transitioned, high utilization rate.
Insight: Low credit limit and transactions, but use the card very often. Could receive benefits to spend more money.
Feature Engineering: Create interaction features¶
Let's explore adding iteraction featues to see how it works.
df_feature_eng = df_encoded.copy()
#For Strategy 1
# Transactions per balance unit (high = frequent, low balance)
df_feature_eng['trans_per_balance'] = df_feature_eng['total_trans_count'] / (df_feature_eng['total_trans_amount'] + 1)
# High frequency, low spending ratio
df_feature_eng['high_freq_low_spend'] = df_feature_eng['total_trans_count'] / (df_feature_eng['avg_utilization_ratio'] + 0.001)
# Combine with credit limit: high usage frequency relative to limit
df_feature_eng['trans_count_to_limit'] = df_feature_eng['total_trans_count'] / (df_feature_eng['credit_limit'] + 1)
# Unified Transactor Score
df_feature_eng['transactor_score'] = (
df_feature_eng['total_trans_count']
/ ((df_feature_eng['total_trans_amount'] + 1) *
(df_feature_eng['avg_utilization_ratio'] + 0.001) *
(df_feature_eng['credit_limit'] + 1))
)
# For Strategy 2
# High income relative to utilization (high value = underused potential)
df_feature_eng['income_to_utilization'] = df_feature_eng['estimated_income'] / (df_feature_eng['avg_utilization_ratio'] + 0.001)
# High income, low transaction activity
df_feature_eng['income_per_transaction'] = df_feature_eng['estimated_income'] / (df_feature_eng['total_trans_count'] + 1)
# Inactivity boost: high income, inactive recently
df_feature_eng['income_inactive_score'] = df_feature_eng['estimated_income'] * df_feature_eng['months_inactive_12_mon']
# Combined underuse score
df_feature_eng['high_income_low_use'] = df_feature_eng['estimated_income'] / ((df_feature_eng['avg_utilization_ratio'] + 0.001) *
(df_feature_eng['total_trans_count'] + 1) *
(df_feature_eng['months_inactive_12_mon'] + 1))
df_feature_eng_cols = df_feature_eng.columns.to_list()
df_feature_eng_cols
['age', 'gender', 'dependent_count', 'education_level', 'estimated_income', 'months_on_book', 'total_relationship_count', 'months_inactive_12_mon', 'credit_limit', 'total_trans_amount', 'total_trans_count', 'avg_utilization_ratio', 'marital_Divorced', 'marital_Married', 'marital_Single', 'marital_Unknown', 'trans_per_balance', 'high_freq_low_spend', 'trans_count_to_limit', 'transactor_score', 'income_to_utilization', 'income_per_transaction', 'income_inactive_score', 'high_income_low_use']
# scale all columns using StandardScaler()
transformer = [
('num', StandardScaler(), df_feature_eng_cols)
]
# following util applies a list of transformers to specified columns, leaves unspecified columns unchanged
df_scaled,feature_names = ds_preprocessing.transform_cols_and_reconstuct_df(df_feature_eng, transformer)
df_scaled.head()
| age | gender | dependent_count | education_level | estimated_income | ... | transactor_score | income_to_utilization | income_per_transaction | income_inactive_score | high_income_low_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.165406 | 1.059956 | 0.503368 | -0.752211 | 0.175810 | ... | -0.398840 | -0.438375 | 0.467513 | -0.646912 | -0.329816 |
| 1 | 0.333570 | -0.943436 | 2.043199 | 0.662787 | -0.967166 | ... | -0.403933 | -0.466069 | -0.419335 | -1.028807 | -0.369825 |
| 2 | 0.583058 | 1.059956 | 0.503368 | 0.662787 | 0.785397 | ... | 0.492742 | 2.432172 | 3.253978 | -0.443235 | 8.834016 |
| 3 | -0.789126 | -0.943436 | 1.273283 | -0.752211 | -0.636973 | ... | -0.410482 | -0.471624 | 0.622681 | 0.023526 | -0.381758 |
| 4 | -0.789126 | 1.059956 | 0.503368 | -1.459710 | 0.074212 | ... | 1.710228 | 1.557454 | 1.095796 | -0.680858 | 4.281558 |
5 rows × 24 columns
numeric_cols = ['age',
'dependent_count',
'estimated_income',
'months_on_book',
'total_relationship_count',
'months_inactive_12_mon',
'credit_limit',
'total_trans_amount',
'total_trans_count',
'avg_utilization_ratio',
'trans_per_balance',
'high_freq_low_spend',
'trans_count_to_limit',
'transactor_score',
'income_to_utilization',
'income_per_transaction',
'income_inactive_score',
'high_income_low_use']
Feature Selection - using interaction features¶
data_plotting.plot_corr_dendrogram(df_scaled[numeric_cols])
(<Figure size 1200x800 with 3 Axes>,
(<Axes: title={'center': 'Hierarchical Clustering Dendrogram'}, ylabel='Ward Linkage Distance'>,
<Axes: title={'center': 'Reordered Correlation Matrix'}>),
array([[1.40000000e+01, 1.70000000e+01, 3.20018511e-02, 2.00000000e+00],
[9.00000000e+00, 1.10000000e+01, 4.54978845e-02, 2.00000000e+00],
[6.00000000e+00, 1.20000000e+01, 8.97934499e-02, 2.00000000e+00],
[7.00000000e+00, 8.00000000e+00, 1.20274588e-01, 2.00000000e+00],
[1.80000000e+01, 1.90000000e+01, 1.50886769e-01, 4.00000000e+00],
[2.00000000e+00, 1.50000000e+01, 1.57455919e-01, 2.00000000e+00],
[0.00000000e+00, 3.00000000e+00, 2.31099049e-01, 2.00000000e+00],
[1.60000000e+01, 2.30000000e+01, 3.11988081e-01, 3.00000000e+00],
[1.00000000e+01, 2.10000000e+01, 4.77219143e-01, 3.00000000e+00],
[1.30000000e+01, 2.20000000e+01, 4.93081169e-01, 5.00000000e+00],
[2.00000000e+01, 2.50000000e+01, 8.24037362e-01, 5.00000000e+00],
[4.00000000e+00, 2.60000000e+01, 8.69200002e-01, 4.00000000e+00],
[1.00000000e+00, 5.00000000e+00, 9.90825909e-01, 2.00000000e+00],
[2.40000000e+01, 3.00000000e+01, 1.06853755e+00, 4.00000000e+00],
[2.70000000e+01, 2.80000000e+01, 1.43999377e+00, 1.00000000e+01],
[2.90000000e+01, 3.10000000e+01, 1.47360442e+00, 8.00000000e+00],
[3.20000000e+01, 3.30000000e+01, 1.77882000e+00, 1.80000000e+01]]))
- High usage vs. low spending → Need to look for clusters with high transactor_score, trans_per_balance, high trans_count_to_limit
- High-income, low engagement → Need to identify clusters with high income_inactive_score and high income_to_utilization
Drop highly correlated features¶
cols_to_drop = [
'age',
'total_trans_amount',
'estimated_income',
'income_per_transaction',
'credit_limit',
'high_freq_low_spend',
'avg_utilization_ratio',
'income_to_utilization'
]
df_select = df_scaled.drop(columns=cols_to_drop)
Selecting number of clusters - using interaction features¶
labels_dict = data_plotting.clustering_plot_elbow_curve_using_KMeans(df=df_select,scale=False,max_k=10)
ds_modeling.clustering_performance_evaluation(df_select,labels_dict,plot_silhouette_analysis=False)
--- Interpretation --- • Silhouette Coefficient: Higher is better (close to 1). • Calinski-Harabasz Index: Higher is better. • Dunn Index: Higher is better. • Davies-Bouldin Index: Lower is better (closer to 0).
| Model | Silhouette | Calinski-Harabasz | Dunn Index | Davies-Bouldin | |
|---|---|---|---|---|---|
| 0 | 2 | 0.119 | 1210.048 | 0.147 | 2.746 |
| 1 | 3 | 0.122 | 1008.041 | 0.050 | 2.451 |
| 2 | 4 | 0.148 | 1047.555 | 0.050 | 2.091 |
| 3 | 5 | 0.114 | 918.182 | 0.026 | 2.171 |
| 4 | 6 | 0.145 | 1008.264 | 0.027 | 1.913 |
| 5 | 7 | 0.144 | 972.044 | 0.027 | 2.024 |
| 6 | 8 | 0.132 | 865.769 | 0.023 | 2.163 |
| 7 | 9 | 0.135 | 888.817 | 0.019 | 1.902 |
| 8 | 10 | 0.145 | 905.729 | 0.029 | 1.806 |
data_plotting.clustering_silhouette_analysis(df_select,labels_dict)
Silhouette coefficient interpretation: - Near +1: Well-clustered - Near 0: On the boundary - Negative: Likely wrong cluster
Interpreting Results - using interaction features¶
# for 4 clusters
labels_k_5 = labels_dict[4]
df_with_k_5 = df_feature_eng.copy()
df_with_k_5['cluster'] = labels_k_5
We need to apply log transformation only to skewed numerical features like estimated_income or credit_limit. We will use log(1 + x) to handle zero values (np.log1p does this).
df_with_k_5['log_income_inactive_score'] = np.log1p(df_with_k_5['income_inactive_score'])
#df_with_k_5['avg_utilization_ratio'] = df_with_k_5['avg_utilization_ratio'] * 100
df_with_k_5['trans_per_balance'] = (df_with_k_5['trans_per_balance'])*1000
df_with_k_5['trans_count_to_limit'] = (df_with_k_5['trans_count_to_limit'])*1000
df_with_k_5['log_transactor_score'] = (df_with_k_5['transactor_score']) * 1000
df_with_k_5['log_high_income_low_use'] = np.log1p(df_with_k_5['high_income_low_use'])
num_cols = [
'dependent_count',
'months_on_book',
'total_relationship_count',
'months_inactive_12_mon',
'total_trans_count',
'trans_per_balance',
'trans_count_to_limit',
'transactor_score',
'income_inactive_score',
'high_income_low_use']
interaction_features_all = [
'trans_per_balance',
'log_transactor_score',
'trans_count_to_limit',
'log_high_income_low_use',
'log_income_inactive_score'
]
# interaction_features_strategy1 = [
# 'trans_per_balance',
# 'log_transactor_score',
# 'trans_count_to_limit',
# ]
# interaction_features_strategy2 = [
# 'log_high_income_low_use',
# 'log_income_inactive_score'
# ]
mean_values_by_cluster = df_with_k_5.groupby('cluster')[interaction_features_all].mean()
mean_values_by_cluster.plot(kind='bar', colormap='tab10', figsize=(10, 6))
plt.xticks(rotation=0)
(array([0, 1, 2, 3]), [Text(0, 0, '0'), Text(1, 0, '1'), Text(2, 0, '2'), Text(3, 0, '3')])
- High usage vs. low spending → Need to look for clusters with high transactor_score, high trans_per_balance and high trans_count_to_limit
- High-income, low engagement → Need to identify clusters with high income_inactive_score and high income_to_utilization
transactor_score (Strategy 1) and high_income_low_use (Strategy 2) are clustering together in cluster 2. This indicates overlapping behavioral patterns.
# for 4 clusters
labels_k_selected = labels_dict[4]
df_with_k_selected = df.copy()
df_with_k_selected['cluster'] = labels_k_selected
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 8))
sns.scatterplot(x='age', y='months_on_book', hue='cluster', data=df_with_k_selected, palette='tab10', alpha=0.4, ax=ax1)
sns.scatterplot(x='estimated_income', y='credit_limit', hue='cluster', data=df_with_k_selected, palette='tab10', alpha=0.4, ax=ax2, legend=False)
sns.scatterplot(x='credit_limit', y='avg_utilization_ratio', hue='cluster', data=df_with_k_selected, palette='tab10', alpha=0.4, ax=ax3)
sns.scatterplot(x='total_trans_count', y='total_trans_amount', hue='cluster', data=df_with_k_selected, palette='tab10', alpha=0.4, ax=ax4, legend=False)
plt.tight_layout()
plt.show()
Interpreting results - categorical columns¶
categorical_cols = ['gender', 'education_level', 'marital_status']
reload(data_plotting)
cross_tabs = data_plotting.create_and_plot_crosstab(df_with_k_selected,'cluster',categorical_cols)
We can observe from the bar plots above:
- education_level: Cluster 2 has similar proportions as other clusters.
- marital_status: Cluster 0 and 1 has a single class but Cluster 2 has all 4 classes.
- gender: Cluster 2 has higher number of females than males, while other clusters are roughly balanced.
Conclusion¶
We first tried clustering without any engineered features. With this method we were able to find clusters with following characteristics:
Cluster 0¶
Characteristics: Mostly women; mostly married; low estimated income; low credit limit; low amount transitioned; high utilization rate. Insight: Married people (majority women) with low income and limit but utilize too much of their credit with a few larger purchases
Cluster 1¶
Characteristics: Mostly men; mostly single and married, high credit limit; high amount transitioned; high number of transactions; low utilization rate; low total relationship count Insight: People with high volume spent on the card, but low utilization. Could be incentivized to spend more and targeted for more engagement.
Cluster 2¶
Characteristics: Mostly women; 100% single people, low estimated income, low credit limit, low amount transitioned, high utilization rate. Insight: Single (mostly women) people that use their card a lot but have low credit limits and income. Could be given a bit more credit limit.
Cluster 3¶
Characteristics: Gender well divided; low credit limit, high utilization rate; marital status 100% unknown; smaller cluster. Insight: Men and women with low credit limits but do have high balances.
Cluster 4¶
Characteristics: Mostly men, mostly single and married, high estimated income, high credit limit; low amount transitioned; low utilization rate. Insight: People (majority men) with high income and credit limits, but don't use the card. Could be incentivized to use it.
Cluster 5¶
Characteristics: Gender well divided, 100% divorced, low credit limit, low amount transitioned, high utilization rate. Insight: Low credit limit and transactions, but use the card very often. Could receive benefits to spend more money.
Then we introduced interaction features into the dataset defined for each strategy. With this feature engineering, we have been able to segment customers for both strategies - High usage, low spending and High income low engagement. But both set of customers are clustering together. To further improve and to separate them, we can try following:
- Use Gaussian Mixture Models (GMM) instead of KMeans — it allows soft assignments and better handles overlapping tendencies.
- Cluster each strategy separately by filtering data and applying focused clustering.
- Increase feature weighting for key differentiators (e.g., scale avg_utilization_ratio or months_inactive_12_mon) to force separation
Author¶
Puneet Pawar